Improving Cross-Modal Attention Via Object Detection